Skip to content

Conversation

@chinmaydk99
Copy link

This PR fixes critical bugs in PyTorch Inductor's multi-kernel support for ROCm/HIP, enabling dynamic shape-based kernel selection for Triton templates. The changes resolve both compilation failures and runtime crashes when using config.multi_kernel_hints.

UTs being targetted:

  1. test/inductor/test_multi_kernel.py - test_triton_gemm
  2. test/inductor/test_multi_kernel.py - test_triton_relu_fused_gemm

pytorchmergebot and others added 30 commits November 18, 2025 17:20
…7661)"

This reverts commit 1b43d6c.

Reverted pytorch#167661 on behalf of https://github.com/yangw-dev due to break internal tests and build, please reach out meta fellas to have fix it and reland again, error examplke: hip/KernelUtils.cuh:74:5: error: no matching function for call to 'unsafeAtomicAdd' ([comment](pytorch#167661 (comment)))
Summary: The export_memory_timeline method in torch.profiler is being deprecated in favor of the newer memory snapshot API (torch.cuda.memory._record_memory_history and torch.cuda.memory._export_memory_snapshot). This change adds the deprecated decorator from typing_extensions and updates the docstring to guide users to the recommended alternative. The decorator will emit a FutureWarning at runtime, and the docstring now includes a .. deprecated:: directive for documentation visibility.

Test Plan: Manual verification that the decorator is properly applied and the deprecation message is informative.

Differential Revision: D87272399

Pull Request resolved: pytorch#168036
Approved by: https://github.com/valentinandrei
This PR introduces a `Tensor` subclass which represents a complex tensor in terms of two real ones. Ops are decomposed as individual ops  on the real and imaginary parts.

It is compatible with `torch.compile`, so long as the real ops used are also compatible. Autograd "works", but is WIP due to different edge-case behaviour.
Pull Request resolved: pytorch#167621
Approved by: https://github.com/ezyang
Repeatition of pytorch#155708
Has been broken for a while, and ET pin in Pytorch are so old that `torch==2.10.0.dev20250915` could no longer be found in nightly indices
Pull Request resolved: pytorch#168090
Approved by: https://github.com/atalman, https://github.com/yangw-dev
This PR enables special matmuls on Thor devices. This includes row-wise scaled matmul on `fp8` and group gemm on `bfloat16`.
Pull Request resolved: pytorch#164836
Approved by: https://github.com/ngimel
…orch#167395)

This adds a debug HTTP server for debugging stuck or slow jobs. It runs the WorkerServer on every worker and then launches a separate flask process on rank 0 to have users connect to for debugging.

This can easily be improved to trigger profilers as well as visualize the data much better.

Initial handlers:
* pytorch profiler
* FlightRecorder data
* Python stacks

```
os.environ["TORCH_NCCL_TRACE_BUFFER_SIZE"] = "2000"

from torch.distributed.debug import enable_debug_server

enable_debug_server()
```

Test plan:

```
torchrun --nnodes 1 --nproc_per_node=gpu ~/scripts/debug_test.py
```

<img width="2000" height="1045" alt="20251117_16h58m18s_grim" src="https://github.com/user-attachments/assets/82305b75-227c-4412-a481-00b622db6bd1" />
<img width="2172" height="1624" alt="20251117_16h58m11s_grim" src="https://github.com/user-attachments/assets/def9841c-c7e6-483a-81c3-cf0c56f6bad8" />
<img width="1985" height="1635" alt="20251117_16h58m03s_grim" src="https://github.com/user-attachments/assets/04fcf148-df58-41b4-8754-8706ee0d1de6" />

Pull Request resolved: pytorch#167395
Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/atalman
…ytorch#167079)

Summary:
As title.

Knowing the size of the leaked tensor is useful, it allows us to focus on the largest leaks.

Differential Revision: D86218574

Pull Request resolved: pytorch#167079
Approved by: https://github.com/kausv
…torch#161703)

it's another pr to port distributed tensor test for Intel GPU, while the other pr is pytorch#161604
We could enable Intel GPU with following methods and try the best to keep the original code styles:

Use torch.accelerator for general gpu
Skip the case if running on xpu which has known issues

Pull Request resolved: pytorch#161703
Approved by: https://github.com/guangyey, https://github.com/d4l3k, https://github.com/albanD
The all gather bucketing was part of the way to fusing in dtype casts into the bucket. We do this by allocating the group bucket buffer, then viewing each slice of it as the destination dtype. We then foreach_copy_ into the allocated buffer, with each collective copying in to its destination dtype.

This logic was causing an issue in a later part of the stack, but not fully firing, so might as well fix it.

Note: custom ops dont yet support list[dtype], so i worked around by list[int], but will fix in a follow up.

Pull Request resolved: pytorch#167853
Approved by: https://github.com/ruisizhang123
ghstack dependencies: pytorch#167852
The bucketing dtype fusing was causing nodes which had dependencies to be erased. Transfer those deps over to the new nodes, and also add an assertion that none of our deps are erased to catch this type of error in the future.

Pull Request resolved: pytorch#167863
Approved by: https://github.com/fmassa
ghstack dependencies: pytorch#167852, pytorch#167853
Since the currently intended workflow on the new MI3xx CI capacity is [trunk-rocm-mi300.yml](https://github.com/pytorch/pytorch/blob/d91269e8ce309437c1f849b5ab3362d69b178ef4/.github/workflows/trunk-rocm-mi300.yml#L54), which only needs the jammy images, limiting those to optimize docker caching times.

Pull Request resolved: pytorch#168088
Approved by: https://github.com/jeffdaily
For GPU: Previously reported that only a single sample could be tested with huber_loss functional. Current snapshot of the code does not appear to suffer from numerical issues as reported before.

For CPU: While testing GPU, it was discovered that with Half appears to be numerically unstable. This commit resolves issue with CPU by upcasting Half to float for the computation.

Pull Request resolved: pytorch#166952
Approved by: https://github.com/benjaminglass1, https://github.com/isuruf
…h/csrc/Exceptions.h (pytorch#168056)

Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: dtolnay

Differential Revision: D87273132

Pull Request resolved: pytorch#168056
Approved by: https://github.com/malfet, https://github.com/Skylion007
Summary: If the Tensor has a PyObject, it's use count will now be two instead of one.

Test Plan: `buck test -j 18 fbcode//mode/dev-nosan fbcode//caffe2/test:torch`

Differential Revision: D87297965

Pull Request resolved: pytorch#168060
Approved by: https://github.com/albanD, https://github.com/Skylion007
As compiler has not been supported for last 3 years and all manylinux2_28 builds should have at least gcc-11

Prep change for C++20 standard migration
Pull Request resolved: pytorch#167933
Approved by: https://github.com/yangw-dev, https://github.com/atalman
ghstack dependencies: pytorch#168090
…rch#168104)

We only want to cache the latest CI docker image for `main` and `release` branches in cases where multiple `docker-builds` workflow runs get triggered in quick succession. This is because the latest run will anyway overwrite the cached images, since we do not maintain a cached image per-SHA, instead it's only one-per-branch (to minimize cache size and docker load times at runner bringup).

Also removing `workflow_dispatch` as a trigger since it won't work (needs artifacts from `docker-builds` run)

Pull Request resolved: pytorch#168104
Approved by: https://github.com/jeffdaily
Fixes pytorch#167905

Below typo correction has been done.

Existing comment:
// List of Any can contains heterogenous types

Suggested comment:
// List of Any can contains heterogeneous types
Pull Request resolved: pytorch#167907
Approved by: https://github.com/albanD
Unclear which PR in the ghstack caused the ROCm failure. Stack was (oldest at bottom):
 - pytorch#167962
 - pytorch#167804
 - pytorch#167803
 - pytorch#167802
 - pytorch#168025

Fixes the following test:

PYTORCH_TEST_WITH_ROCM=1 python test/cpp_extensions/libtorch_agnostic_2_10_extension/test_version_compatibility.py FunctionVersionCompatibilityTest.test_mv_tensor_accessor_cuda_works_with_2_9

Pull Request resolved: pytorch#168087
Approved by: https://github.com/jeffdaily, https://github.com/janeyx99

Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Jane (Yuan) Xu <[email protected]>
Fixes false negative (illusion):  "all B200 periodic nvshmem-triton tests passed"

Pull Request resolved: pytorch#167760
Approved by: https://github.com/ngimel
# Motivation
This is definitely a bug: we were attempting to release cached memory back to the system without proper **synchronization**. Callers must ensure that all accesses to memory blocks allocated by SYCL APIs have completed before invoking `sycl::free`.

For a simple example, in the following code:
```python
pool = torch.xpu.MemPool()
with torch.xpu.use_mem_pool(pool):
    input = torch.randn(100, device='xpu')
sum = input.sum()
del pool
print(sum)
```
`sum` may exhibit undefined behavior because `input.sum()` might not have finished executing before `del pool` triggers `input`'s memory release.

With this fix, we ensure that all kernels on the associated streams complete before the memory pool is destroyed, guaranteeing that `sum` holds the correct value.

# Solution
Because `c10::xpu::syncStreamsOnDevice` has host overhead, we use a boolean flag `streams_synced` to ensure it is called only once.
Pull Request resolved: pytorch#168074
Approved by: https://github.com/EikanWang
…6833)

The implementation plan of MemPool for XPU, which is the dependance of [XPUGraph](pytorch#166285), following the [RFC](pytorch#162143).

- [ ] pytorch#166831
- [ ] ->pytorch#166833
- [ ] pytorch#166843
Pull Request resolved: pytorch#166833
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
Summary: Fix pytorch#167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key.

Pull Request resolved: pytorch#168063
Approved by: https://github.com/yushangdi
Fixes #ISSUE_NUMBER

Pull Request resolved: pytorch#168111
Approved by: https://github.com/ezyang
…torch#166273)

Partially vibe-coded with ClaudeCode, and changes following ops (summary also created by Claude):
- **Activation operations**: Added checks rejecting Long, Complex, and Bool types for operations like log_softmax, log_sigmoid, mish, softplus, and silu, as MPS doesn't support exponent operations on these types

- **Linear algebra operations**: Restricted linalg_lu_factor, linalg_solve, and linalg_solve_triangular to Float type only (previously only checked for complex types)

- **Pooling operations**: Added checks to reject Complex types for avg_pool2d and max_pool2d operations

- **Loss functions**: Added type checks for nll_loss (Complex), huber_loss (Long, Complex), and grid_sampler_2d (Complex)

- **Reduction operations**:
  - Fixed NANSUM to handle integral types correctly (can't contain NaN, so just performs regular sum)
  - Added Long type check for std/var operations

- **Other operations**:
  - softmax: Now explicitly requires floating point types
  - bincount: Rejects Bool type to prevent crashes

All checks use `TORCH_CHECK_NOT_IMPLEMENTED`
Pull Request resolved: pytorch#166273
Approved by: https://github.com/manuelcandales
Summary: Shrink binary size to reduce relocation overflows. The most important change is to split `intrusive_ptr::reset_()` into two functions and mark the bigger one as `C10_NOINLINE`.

Differential Revision: D87308588

Pull Request resolved: pytorch#168080
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet, https://github.com/ezyang
# Motivation
Thanks to @KarhouTam for finding the issue mentioned in pytorch#167172
This PR aims to improve the build logic in activities for kineto.

# Additional Context
Fix pytorch#167172

Pull Request resolved: pytorch#167204
Approved by: https://github.com/EikanWang, https://github.com/ezyang
pytorchupdatebot and others added 17 commits November 19, 2025 06:11
This PR adds support for effectful ops within invoke_subgraphs.
* Most of the logic is in `invoke_subgraph.py_functionalize_impl`.
  * In the functionalization metadata collection phase, we note the tokens before going further down the dispatcher, and then note the tokens after coming back from the dispatcher. If there are nodes in the invoke_subgraph subgraph that contain effects, the number of effects should change, or the tokens used for an effect should.
  * We will store this effect difference in the `InvokeSubgraphCache` where the key is the identifier and value is the effect. For now we only support one effect within a subgraph.
  * During the tracing part of AOTAutograd, we will then wrap the subgraph to take in and output a token.

Before:
```
def forward(self, x):
    repeated_subgraph0 = self.repeated_subgraph0
    invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', x)
    return invoke_subgraph

def repeated_subgraph(self, x):
    record_memory = torch.ops.mylib.record_memory.default("forward", "N")
    add = torch.ops.aten.add(x, x)
    return add
```
After:
```
def forward(self, token, x):
    repeated_subgraph0 = self.repeated_subgraph0
    invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', token, x)
    getitem = invoke_subgraph[0]  # output token
    getitem_1 = invoke_subgraph[1]
    return (getitem, getitem_1)

def repeated_subgraph(self, token, x):
    with_effects = torch.ops.higher_order.with_effects(token, torch.ops.mylib.record_memory.default, 'forward', 'N')
    getitem = with_effects[0]  # output token
    add = torch.ops.aten.add(x, x)
    return  (getitem, add)
```

* Then there is a bunch of logic within `_remove_effect_tokens` to handle removing the effects from the invoke_subgraph subgraph

Differential Revision: [D87392741](https://our.internmc.facebook.com/intern/diff/D87392741)
Pull Request resolved: pytorch#167231
Approved by: https://github.com/anijain2305
…torch#167245)

In the [previous PR](https://github.com/pytorch/pytorch/pull/167231/files#diff-e2b74af5d8b538a7d07d18507d27010703742ddad5f819992b55f5abc6d9a502R964-R966) we found that the autograd eager impl of invoke_subgraph calls the subgraph twice. If the subgraph contains effects then effects will be run twice, which is bad. This PR fixes the issue by getting the output metadata from `subgraph`'s `node.meta` if it exists.

Differential Revision: [D87392740](https://our.internmc.facebook.com/intern/diff/D87392740)
Pull Request resolved: pytorch#167245
Approved by: https://github.com/anijain2305
ghstack dependencies: pytorch#167231
`fully_shard`'s `gradient_divide_factor` isn't currently respected when the sharding degree = 1. This PR ensures the division factor applies also in this case.

This is a bit of an edge case, but it arises in `torchtitan`, e.g. with expert parallelism and `ep_degree=world_size` we still wrap the routed experts in `fully_shard` because:
1) It lets us take advantage of its mixed-precision mechanisms.
2) [A specific gradient_divide_factor is needed for correctness](https://github.com/pytorch/torchtitan/blob/176498cd4edd4d80e95959a618279681f8295f4c/torchtitan/models/llama4/infra/parallelize.py?plain=1#L364-L369)

This PR ensures correctness in the `reduce_scatter_group.size()==1` case.

Reproducer and sample failures are in the [gist here](https://gist.github.ibm.com/goon/f67e7559284cc2d322faff1ac59fe382). The net effect is that the EP grads are too-large by a factor of the world size in the case described above. I checked that the proposed fix makes these tests pass.

I guess I should add a test for this, too?

Pull Request resolved: pytorch#167178
Approved by: https://github.com/weifengpy
…efs_fft_irfftn_cuda_complex64` (pytorch#168016)

Otherwise we see e.g.,
```
Mismatched elements: 1 / 40320 (0.0%)
Greatest absolute difference: 0.0001373291015625 at index (0, 4, 0, 2, 3, 5) (up to 0.0001 allowed)
Greatest relative difference: 1.633889951335732e-05 at index (0, 4, 0, 2, 3, 5) (up to 1.3e-06 allowed)
```

Pull Request resolved: pytorch#168016
Approved by: https://github.com/nWEIdia, https://github.com/ezyang
This is the necessary fix for meta-pytorch/autoparallel#256.

### Issue:
when we call `_clear_fast_path_sharding_prop_cache()`, and then `get_thread_local_native_sharding_propagator_cache()`, the code will stuck due to deadlock.

### Cause:
When you assign to a Python dict key that already exists:
```C++
thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = old_capsule  // capsule ROCm#1 stored
...
clear_DTensor_sharding_propagator_cache() // call to clean up the cache
...
get_thread_local_native_sharding_propagator_cache() {
  std::lock_guard<std::mutex> lock(
        native_sharding_propagator_cache_cleanup_mutex);  // FIRST claims the lock!
  if (!native_sharding_propagator_cache_DO_NOT_USE.has_value()) { // enter this again because we have cleared the cache.
    ...
    // Destroys old_capsule FIRST then stores new_capsule. However, where we destroy the old_capsule,
    // it will trigger the destructor to claim `native_sharding_propagator_cache_cleanup_mutex` again!
    thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = new_capsule  // SECOND claims the lock before FIRST releases
  }
}
```

Pull Request resolved: pytorch#168069
Approved by: https://github.com/ezyang
# Motivation
This PR aims to fix the bug that the moved-to object's `external_` member is not assigned correctly.

# Additional Context
It's not fine to swap the valid value and the invalid value. We'd just need to prevent double-free.

Pull Request resolved: pytorch#167711
Approved by: https://github.com/albanD
adds support for randomness in localtensor. tl;dr it needs to be able to handle RNG the same way (i.e., rng tracking/syncing across shards, user-defined seeds, user-defined generators, etc.

we extend the existing OffsetBasedRNGTracker to play nicely with localtensor's setup, creating a few small subclasses and patching the core RNG logic to manage the per-rank seeds and offsets correctly.

i still haven't done the per-rank generator support (since the existing tests imply a globally-seeded generator), but that it something that should be done.

Pull Request resolved: pytorch#166540
Approved by: https://github.com/dzmitry-huba
…rch#166395)

Implement a complete OpenRegDeviceAllocator with the following enhancements:

- Implement memory statistics tracking (allocated/reserved bytes, allocation count)
   - Track allocation sizes for accurate memory statistics
- Refactor DeviceAllocator's inheritance relationship from c10::DeviceAllocator
  - This change is for further improvement of adding a memory caching function to DeviceMemory

Add comprehensive test coverage:
- Memory allocation/deallocation tests with statistics validation
- Storage operations and tensor-from-blob tests
- Multithreading safety tests for concurrent allocations
- Gradient tracking and requires_grad compatibility tests

Fixes pytorch#166157

Pull Request resolved: pytorch#166395
Approved by: https://github.com/fffrog
…extension (pytorch#167855)

Address Nikita's offline comment on pytorch#167496

Pull Request resolved: pytorch#167855
Approved by: https://github.com/janeyx99
ghstack dependencies: pytorch#167496
…#168049)"

This reverts commit 8cb8b6c.

Reverted pytorch#168049 on behalf of https://github.com/yangw-dev due to D87346992 internal error that conflict the main branch, please rebase and try to merge again These changes have conflicts when merging with master branch. Rebase this diff. ([comment](pytorch#168049 (comment)))
…)"

This reverts commit db1551b.

Reverted pytorch#168024 on behalf of https://github.com/yangw-dev due to Internal merge fail, These changes have conflicts when merging with master branch. Rebase this diff. please rebase the pr and try merge again ([comment](pytorch#168024 (comment)))
…cts (pytorch#168149)

grad_placements is a sequence like data structure and therefore can be a
UserDefinedObject. In that case, we can extract the tuple and pass
along.

Pull Request resolved: pytorch#168149
Approved by: https://github.com/bdhirsh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.